18 research outputs found
Efficient Modeling of Surrogates to Improve Multi-source High-dimensional Biobank Studies
Surrogate variables in electronic health records (EHR) and biobank data play
an important role in biomedical studies due to the scarcity or absence of
chart-reviewed gold standard labels. We develop a novel approach named SASH for
{\bf S}urrogate-{\bf A}ssisted and data-{\bf S}hielding {\bf H}igh-dimensional
integrative regression. It is a semi-supervised approach that efficiently
leverages sizable unlabeled samples with error-prone EHR surrogate outcomes
from multiple local sites, to improve the learning accuracy of the small
gold-labeled data. {To facilitate stable and efficient knowledge extraction
from the surrogates, our method first obtains a preliminary supervised
estimator, and then uses it to assist training a regularized single index model
(SIM) for the surrogates. Interestingly, through a chain of convex and properly
penalized sparse regressions that approximate the SIM loss with
bias-correction, our method avoids the local minima issue of the SIM training,
and fully eliminates the impact of the preliminary estimator's large error. In
addition, it protects individual-level information through
summary-statistics-based data aggregation across the local sites, leveraging a
similar idea of bias-corrected approximation for SIM.} Through simulation
studies, we demonstrate that our method outperforms existing approaches on
finite samples. Finally, we apply our method to develop a high dimensional
genetic risk model for type II diabetes using large-scale data sets from UK and
Mass General Brigham biobanks, where only a small fraction of subjects in one
site has been labeled via chart reviewing
Doubly Robust Augmented Model Accuracy Transfer Inference with High Dimensional Features
Due to label scarcity and covariate shift happening frequently in real-world
studies, transfer learning has become an essential technique to train models
generalizable to some target populations using existing labeled source data.
Most existing transfer learning research has been focused on model estimation,
while there is a paucity of literature on transfer inference for model accuracy
despite its importance. We propose a novel oubly obust
ugmented odel ccuracy ransfer
nferene (DRAMATIC) method for point and interval
estimation of commonly used classification performance measures in an unlabeled
target population using labeled source data. Specifically, DRAMATIC derives and
evaluates the risk model for a binary response against some low dimensional
predictors on the target population, leveraging from source
data only and high dimensional adjustment features from both the
source and target data. The proposed estimators are doubly robust in the sense
that they are consistent when at least one model is correctly
specified and certain model sparsity assumptions hold. Simulation results
demonstrate that the point estimation have negligible bias and the confidence
intervals derived by DRAMATIC attain satisfactory empirical coverage levels. We
further illustrate the utility of our method to transfer the genetic risk
prediction model and its accuracy evaluation for type II diabetes across two
patient cohorts in Mass General Brigham (MGB) collected using different
sampling mechanisms and at different time points
Assessing Heterogeneous Risk of Type II Diabetes Associated with Statin Usage: Evidence from Electronic Health Record Data
There have been increased concerns that the use of statins, one of the most
commonly prescribed drugs for treating coronary artery disease, is potentially
associated with the increased risk of new-onset type II diabetes (T2D).
However, because existing clinical studies with limited sample sizes often
suffer from selection bias issues, there is no robust evidence supporting as to
whether and what kind of populations are indeed vulnerable for developing T2D
after taking statins. In this case study, building on the biobank and
electronic health record data in the Partner Health System, we introduce a new
data analysis pipeline from a biological perspective and a novel statistical
methodology that address the limitations in existing studies to: (i)
systematically examine heterogeneous treatment effects of stain use on T2D
risk, (ii) uncover which patient subgroup is most vulnerable to T2D after
taking statins, and (iii) assess the replicability and statistical significance
of the most vulnerable subgroup via bootstrap calibration. Our proposed
bootstrap calibration approach delivers asymptotically sharp confidence
intervals and debiased estimates for the treatment effect of the most
vulnerable subgroup in the presence of possibly high-dimensional covariates. By
implementing our proposed approach, we find that females with high T2D genetic
risk at baseline are indeed at high risk of developing T2D due to statin use,
which provides evidences to support future clinical decisions with respect to
statin use.Comment: 31 pages, 2 figures, 6 table
Assessing the Most Vulnerable Subgroup to Type II Diabetes Associated with Statin Usage: Evidence from Electronic Health Record Data
There have been increased concerns that the use of statins, one of the most commonly prescribed drugs for treating coronary artery disease, is potentially associated with the increased risk of new-onset type II diabetes (T2D). Nevertheless, to date, there is no robust evidence supporting as to whether and what kind of populations are indeed vulnerable for developing T2D after taking statins. In this case study, leveraging the biobank and electronic health record data in the Partner Health System, we introduce a new data analysis pipeline and a novel statistical methodology that address existing limitations by (i) designing a rigorous causal framework that systematically examines the causal effects of statin usage on T2D risk in observational data, (ii) uncovering which patient subgroup is most vulnerable for developing T2D after taking statins, and (iii) assessing the replicability and statistical significance of the most vulnerable subgroup via a bootstrap calibration procedure. Our proposed approach delivers asymptotically sharp confidence intervals and debiased estimate for the treatment effect of the most vulnerable subgroup in the presence of high-dimensional covariates. With our proposed approach, we find that females with high T2D genetic risk are at the highest risk of developing T2D due to statin usage.</p
Spatial distribution and potential sources of arsenic and water-soluble ions in the snow at Ili River Valley, China
Trace elements and water-soluble ions in snow can be used as indicators to reveal natural and anthropogenic emissions. To understand the chemical composition, characteristics of snow and their potential sources in the Ili River Valley (IRV), snow samples were collected from 17 sites in the IRV from December 2018 to March 2019. Inverse distance weighting, enrichment factor (EF) analysis, and backward trajectory modelling were applied to evaluate the spatial distributions and sources of water-soluble ions and dissolved arsenic (As) in snow. The re-sults indicate that Ca2+ and SO42-were the dominant ions, and the concentrations of As ranged from 0.09 to 0.503 mu g L-1. High concentrations of As were distributed in the northwest and middle of the IRV, and the concentrations of the major ions were high in the west of the IRV. The strong correlation of As with F-, SO42-, and NO2- demonstrates that As mainly originated from coal-burning and agricultural activities. Principal component analysis showed that the ions originated from a combination of anthropogenic and crustal sources. The EFs showed that K+, SO42-, and Mg2+ were mainly influenced by human activities. Backward trajectory cluster analysis suggested that the chemical composition of snow was affected by soil dust transport from the western air mass, the unique terrain, and local anthropogenic activities. These results provide important sci-entific insights for atmospheric environmental management and agricultural production within the IRV
Changes in laboratory value improvement and mortality rates over the course of the pandemic: an international retrospective cohort study of hospitalised patients infected with SARS-CoV-2
International audienceObjective To assess changes in international mortality rates and laboratory recovery rates during hospitalisation for patients hospitalised with SARS-CoV-2 between the first wave (1 March to 30 June 2020) and the second wave (1 July 2020 to 31 January 2021) of the COVID-19 pandemic. Design, setting and participants This is a retrospective cohort study of 83 178 hospitalised patients admitted between 7 days before or 14 days after PCR-confirmed SARS-CoV-2 infection within the Consortium for Clinical Characterization of COVID-19 by Electronic Health Record, an international multihealthcare system collaborative of 288 hospitals in the USA and Europe. The laboratory recovery rates and mortality rates over time were compared between the two waves of the pandemic. Primary and secondary outcome measures The primary outcome was all-cause mortality rate within 28 days after hospitalisation stratified by predicted low, medium and high mortality risk at baseline. The secondary outcome was the average rate of change in laboratory values during the first week of hospitalisation. Results Baseline Charlson Comorbidity Index and laboratory values at admission were not significantly different between the first and second waves. The improvement in laboratory values over time was faster in the second wave compared with the first. The average C reactive protein rate of change was –4.72 mg/dL vs –4.14 mg/dL per day (p=0.05). The mortality rates within each risk category significantly decreased over time, with the most substantial decrease in the high-risk group (42.3% in March–April 2020 vs 30.8% in November 2020 to January 2021, p<0.001) and a moderate decrease in the intermediate-risk group (21.5% in March–April 2020 vs 14.3% in November 2020 to January 2021, p<0.001). Conclusions Admission profiles of patients hospitalised with SARS-CoV-2 infection did not differ greatly between the first and second waves of the pandemic, but there were notable differences in laboratory improvement rates during hospitalisation. Mortality risks among patients with similar risk profiles decreased over the course of the pandemic. The improvement in laboratory values and mortality risk was consistent across multiple countries